Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

16 ◾ Bioinformatics

Use “gzip -d” to decompress a compressed file.

gzip -d SRR030834.fastq.gz

If you need to know the number of records in a FASTQ file, you can use a combination of

“cat” or “zcat” and “wc -l”, which counts the number of lines in a text file. Remember that

a record in a FASTQ file has 4 lines. We can use the Unix pipe symbol “|” to transfer the

output of the “cat” command to the “wc -l” command. The following command line will

count the number of records stored in the FASTQ files:

cat SRR030834.fastq | echo $((`wc -l`/4))

If we need to display the file name and read count for multiple files, with the “.fastq” file

name extension, in a directory, we can use the following script:

for filename in *.fastq;

echo -e “$filename\t `cat $filename | wc -l | awk ‘{print $1 /

4}’`”

done

To display a FASTQ file in a tabular format, you can use the “cat” command and then use

the Unix pipe to transfer the output to the “paste” command, which converts the four lines

of the FASTQ records into tabular format.

cat SRR030834.fastq | paste - - - - > SRR030834_tab.txt

The command will store the new tabular file in a new file “SRR030834_tab.txt”. You can

open this file in any spreadsheet, or you can display it as follows:

less -S SRR030834_tab.txt

Creating a tabular file from a FASTQ file will help us to perform several operations such as

sorting of the entries, filtering out the duplicate reads, extracting read IDs, sequences, or

quality scores, and creating a FASTA file. We expect that the format of the identifier lines

of a FASTQ file is consistent. If you display “SRR030834_tab.txt”, you will notice that some

of the identifier line fields are separated by spaces, and if we consider the space as a column

separator, the IDs will be in the first column and the sequence will be in the fourth column.

However, this column order may be different in tabular files extracted from other FASTQ

files. Assume that we wish to extract only the IDs and sequences from “SRR030834_tab.

txt” in a separate text file, then we can use the “awk” command as follows:

awk ‘{print $1 “\t” $4}’ SRR030834_tab.txt > SRR030834_seq.txt